{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install parquetdb\n", "!pip install ipykernel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 03 - Graph Generators in ParquetGraphDB\n", "\n", "In this notebook, we'll learn how to:\n", "\n", "1. Create node generator\n", "2. Add the node generator to the graph\n", "3. Create edge generator\n", "4. Add the edge generator to the graph\n", "5. Defining dependencies between generators\n", "\n", "We'll use the `ParquetGraphDB` class from `parquetdb` to demonstrate these features. If you haven't already installed `parquetdb`, run the previous cell.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Scenario: Modeling Materials Data\n", "\n", "Let's explore how `parquetdb` generators can build and maintain a graph using a materials science scenario. Materials, at their core, are described by their **structure** and the **chemical elements** they contain (their **composition**).\n", "\n", "We can represent this information effectively using a **heterograph**:\n", "\n", "* Nodes representing **Materials** (like $H_2O$ or $Fe$).\n", "* Nodes representing **Elements** (like $H$, $O$, $Fe$).\n", "* Edges showing which **Elements** make up which **Materials**.\n", "\n", "The real power of generators becomes apparent when considering how this data originates and evolves. Material definitions might come from external files or databases, and the properties of elements might be sourced separately.\n", "\n", "Generators provide a robust mechanism to:\n", "* **Ingest and process** this source data into graph nodes and edges.\n", "* **Establish dependencies**. For instance, the creation of material-element edges depends on both Material and Element nodes existing first.\n", "* **Automate updates**. If the input file defining materials changes, or if an element's properties are updated in its source, generators allow `parquetdb` to potentially rebuild the affected parts of the graph automatically, ensuring consistency.\n", "\n", "We'll now set up this example, starting with the data sources for elements and materials." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloaded periodic table data with 118 elements\n", "Downloaded periodic table data with 1000 elements\n" ] } ], "source": [ "import os\n", "import shutil\n", "import requests\n", "import io\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "import pyarrow as pa\n", "import pyarrow.parquet as pq\n", "\n", "def download_url(url,save_path):\n", " # Download the parquet file\n", " response = requests.get(url)\n", " if response.status_code == 200:\n", " # Load the parquet file into a pandas DataFrame\n", " parquet_file = io.BytesIO(response.content)\n", " periodic_table = pq.read_table(parquet_file)\n", " print(f\"Downloaded periodic table data with {len(periodic_table)} elements\")\n", " else:\n", " raise \"Could not download data\"\n", "\n", " pq.write_table(periodic_table, save_path)\n", " \n", " \n", "FILE_DIR = Path(\".\")\n", "DATA_DIR = FILE_DIR / \"data\"\n", "\n", "if DATA_DIR.exists():\n", " shutil.rmtree(DATA_DIR)\n", " \n", "DATA_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "# URL to the raw data file in the GitHub repository\n", "elements_url = \"https://github.com/lllangWV/ParquetDB/raw/GraphDB/tests/graph/data/interim_periodic_table_values.parquet\"\n", "materials_url = \"https://github.com/lllangWV/ParquetDB/raw/GraphDB/tests/graph/data/materials/materials_0.parquet\"\n", "\n", "elements_file = DATA_DIR / \"elements.parquet\"\n", "materials_file = DATA_DIR / \"materials.parquet\"\n", "\n", "download_url(elements_url,elements_file)\n", "download_url(materials_url,materials_file)\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pyarrow.Table\n", "long_name: string\n", "symbol: string\n", "abundance_universe: double\n", "abundance_solar: double\n", "abundance_meteor: double\n", "abundance_crust: double\n", "abundance_ocean: double\n", "abundance_human: double\n", "adiabatic_index: string\n", "allotropes: string\n", "appearance: string\n", "atomic_mass: double\n", "atomic_number: int64\n", "block: string\n", "boiling_point: double\n", "classifications_cas_number: string\n", "classifications_cid_number: string\n", "classifications_rtecs_number: string\n", "classifications_dot_numbers: string\n", "classifications_dot_hazard_class: double\n", "conductivity_thermal: double\n", "cpk_hex: string\n", "critical_pressure: double\n", "critical_temperature: double\n", "crystal_structure: string\n", "density_stp: double\n", "discovered_year: int64\n", "discovered_by: string\n", "discovered_location: string\n", "electron_affinity: double\n", "electron_configuration: string\n", "electron_configuration_semantic: string\n", "electronegativity_pauling: double\n", "energy_levels: string\n", "gas_phase: string\n", "group: int64\n", "extended_group: int64\n", "half_life: string\n", "heat_specific: double\n", "heat_vaporization: double\n", "heat_fusion: double\n", "heat_molar: double\n", "isotopes_known: string\n", "isotopes_stable: string\n", "isotopic_abundances: string\n", "lattice_angles: string\n", "lattice_constants: string\n", "lifetime: string\n", "magnetic_susceptibility_mass: double\n", "magnetic_susceptibility_molar: double\n", "magnetic_susceptibility_volume: double\n", "magnetic_type: string\n", "melting_point: double\n", "molar_volume: double\n", "neutron_cross_section: double\n", "neutron_mass_absorption: double\n", "oxidation_states: string\n", "period: int64\n", "phase: string\n", "quantum_numbers: string\n", "radius_calculated: double\n", "radius_empirical: double\n", "radius_covalent: double\n", "radius_vanderwaals: double\n", "refractive_index: double\n", "series: string\n", "source: string\n", "space_group_name: string\n", "space_group_number: double\n", "speed_of_sound: double\n", "summary: string\n", "valence_electrons: double\n", "conductivity_electric: double\n", "electrical_resistivity: double\n", "electrical_type: string\n", "modulus_bulk: double\n", "modulus_shear: double\n", "modulus_young: double\n", "poisson_ratio: double\n", "coefficient_of_linear_thermal_expansion: double\n", "hardness_vickers: double\n", "hardness_brinell: double\n", "hardness_mohs: double\n", "superconduction_temperature: double\n", "is_actinoid: bool\n", "is_alkali: bool\n", "is_alkaline: bool\n", "is_chalcogen: bool\n", "is_halogen: bool\n", "is_lanthanoid: bool\n", "is_metal: bool\n", "is_metalloid: bool\n", "is_noble_gas: bool\n", "is_post_transition_metal: bool\n", "is_quadrupolar: bool\n", "is_rare_earth_metal: bool\n", "experimental_oxidation_states: string\n", "ionization_energies: string\n", "----\n", "long_name: [[\"Hydrogen\",\"Helium\",\"Lithium\",\"Beryllium\",\"Boron\",...,\"Flerovium\",\"Moscovium\",\"Livermorium\",\"Tennessine\",\"Oganesson\"]]\n", "symbol: [[\"H\",\"He\",\"Li\",\"Be\",\"B\",...,\"Fl\",\"Mc\",\"Lv\",\"Ts\",\"Og\"]]\n", "abundance_universe: [[75,23,6e-7,1e-7,1e-7,...,0,0,0,0,0]]\n", "abundance_solar: [[75,23,6e-9,1e-8,2e-7,...,0,0,0,0,0]]\n", "abundance_meteor: [[2.4,null,0.00017,0.0000029,0.00016,...,0,0,0,0,0]]\n", "abundance_crust: [[0.15,5.5e-7,0.0017,0.00019,0.00086,...,0,0,0,0,0]]\n", "abundance_ocean: [[11,7.2e-10,0.000018,6e-11,0.00044,...,0,0,0,0,0]]\n", "abundance_human: [[10,null,0.000003,4e-8,0.00007,...,0,0,0,0,0]]\n", "adiabatic_index: [[\"5-Jul\",\"3-May\",null,null,null,...,null,null,null,null,null]]\n", "allotropes: [[\"Dihydrogen\",null,null,null,\"Alpha Rhombohedral Boron, Beta Rhombohedral Boron, Alpha Tetragonal Boron\",...,null,null,null,null,null]]\n", "...\n", "pyarrow.Table\n", "bonding.cutoff_method.bond_connections: list>\n", " child 0, element: list\n", " child 0, element: int64\n", "bonding.electric_consistent.bond_connections: list>\n", " child 0, element: list\n", " child 0, element: double\n", "bonding.electric_consistent.bond_orders: list>\n", " child 0, element: list\n", " child 0, element: double\n", "bonding.geometric_consistent.bond_connections: list>\n", " child 0, element: list\n", " child 0, element: double\n", "bonding.geometric_consistent.bond_orders: list>\n", " child 0, element: list\n", " child 0, element: double\n", "bonding.geometric_electric_consistent.bond_connections: list>\n", " child 0, element: list\n", " child 0, element: double\n", "bonding.geometric_electric_consistent.bond_orders: list>\n", " child 0, element: list\n", " child 0, element: double\n", "chargemol.bond_connections: list>\n", " child 0, element: list\n", " child 0, element: double\n", "chargemol.bond_orders: list>\n", " child 0, element: list\n", " child 0, element: double\n", "chargemol.cubed_moments: list\n", " child 0, element: double\n", "chargemol.fourth_moments: list\n", " child 0, element: double\n", "chargemol.squared_moments: list\n", " child 0, element: double\n", "chemenv.coordination_environments_multi_weight: list>>>\n", " child 0, element: list>>\n", " child 0, element: struct>\n", " child 0, ce_fraction: double\n", " child 1, ce_symbol: string\n", " child 2, csm: double\n", " child 3, permutation: list\n", " child 0, element: int64\n", "chemenv.coordination_multi_connections: list>\n", " child 0, element: list\n", " child 0, element: int64\n", "chemenv.coordination_multi_numbers: list\n", " child 0, element: int64\n", "core.atomic_numbers: list\n", " child 0, element: int64\n", "core.cartesian_coords: list>\n", " child 0, element: list\n", " child 0, element: double\n", "core.density: double\n", "core.density_atomic: double\n", "core.elements: list\n", " child 0, element: string\n", "core.energy_per_atom: double\n", "core.formula: string\n", "core.formula_pretty: string\n", "core.frac_coords: list>\n", " child 0, element: list\n", " child 0, element: double\n", "core.is_gap_direct: bool\n", "core.is_magnetic: bool\n", "core.is_metal: bool\n", "core.is_stable: bool\n", "core.lattice: extension\n", "core.material_id: string\n", "core.nelements: int64\n", "core.nsites: int64\n", "core.species: list\n", " child 0, element: string\n", "core.volume: double\n", "dielectric.e_electronic: double\n", "dielectric.e_ij_max: double\n", "dielectric.e_ionic: double\n", "dielectric.e_total: double\n", "dielectric.n: double\n", "elasticity.compliance_tensor_ieee_format: extension\n", "elasticity.compliance_tensor_raw: extension\n", "elasticity.debye_temperature: double\n", "elasticity.elastic_tensor_ieee_format: extension\n", "elasticity.elastic_tensor_raw: extension\n", "elasticity.g_reuss: double\n", "elasticity.g_voigt: double\n", "elasticity.g_vrh: double\n", "elasticity.homogeneous_poisson: double\n", "elasticity.k_reuss: double\n", "elasticity.k_voigt: double\n", "elasticity.k_vrh: double\n", "elasticity.order: int64\n", "elasticity.sound_velocity_acoustic: double\n", "elasticity.sound_velocity_longitudinal: double\n", "elasticity.sound_velocity_optical: double\n", "elasticity.sound_velocity_total: double\n", "elasticity.sound_velocity_transverse: double\n", "elasticity.state: string\n", "elasticity.thermal_conductivity_cahill: double\n", "elasticity.thermal_conductivity_clarke: double\n", "elasticity.universal_anisotropy: double\n", "elasticity.warnings: list\n", " child 0, element: string\n", "elasticity.young_modulus: null\n", "electronic_structure.band_gap: double\n", "electronic_structure.cbm: double\n", "electronic_structure.dos_energy_up: null\n", "electronic_structure.efermi: double\n", "electronic_structure.vbm: double\n", "feature_vectors.element_fraction: extension\n", "feature_vectors.element_property: extension\n", "feature_vectors.sine_coulomb_matrix: extension\n", "feature_vectors.xrd_pattern: extension\n", "grain_boundaries.grain_boundaries: list>\n", " child 0, element: struct\n", " child 0, gb_energy: double\n", " child 1, rotation_angle: double\n", " child 2, sigma: int64\n", " child 3, type: string\n", "has_props.absorption: bool\n", "has_props.bandstructure: bool\n", "has_props.charge_density: bool\n", "has_props.chemenv: bool\n", "has_props.dielectric: bool\n", "has_props.dos: bool\n", "has_props.elasticity: bool\n", "has_props.electronic_structure: bool\n", "has_props.eos: bool\n", "has_props.grain_boundaries: bool\n", "has_props.insertion_electrodes: bool\n", "has_props.magnetism: bool\n", "has_props.materials: bool\n", "has_props.oxi_states: bool\n", "has_props.phonon: bool\n", "has_props.piezoelectric: bool\n", "has_props.provenance: bool\n", "has_props.substrates: bool\n", "has_props.surface_properties: bool\n", "has_props.thermo: bool\n", "has_props.xas: bool\n", "id: int64\n", "magnetism.num_magnetic_sites: int64\n", "magnetism.num_unique_magnetic_sites: int64\n", "magnetism.ordering: string\n", "magnetism.total_magnetization: double\n", "magnetism.total_magnetization_normalized_vol: double\n", "magnetism.types_of_magnetic_species: list\n", " child 0, element: string\n", "metadata.last_updated: string\n", "metadata.theoretical: bool\n", "oxidation_states.method: string\n", "oxidation_states.possible_species: list\n", " child 0, element: string\n", "oxidation_states.possible_valences: list\n", " child 0, element: double\n", "structure.@class: string\n", "structure.@module: string\n", "structure.charge: double\n", "structure.lattice.a: double\n", "structure.lattice.alpha: double\n", "structure.lattice.b: double\n", "structure.lattice.beta: double\n", "structure.lattice.c: double\n", "structure.lattice.gamma: double\n", "structure.lattice.matrix: extension\n", "structure.lattice.pbc: extension\n", "structure.lattice.volume: double\n", "structure.sites: list, label: string, properties: struct, species: list>, xyz: list>>\n", " child 0, element: struct, label: string, properties: struct, species: list>, xyz: list>\n", " child 0, abc: list\n", " child 0, element: double\n", " child 1, label: string\n", " child 2, properties: struct\n", " child 0, magmom: double\n", " child 3, species: list>\n", " child 0, element: struct\n", " child 0, element: string\n", " child 1, occu: int64\n", " child 4, xyz: list\n", " child 0, element: double\n", "surface_properties.shape_factor: double\n", "surface_properties.surface_anisotropy: double\n", "surface_properties.weighted_surface_energy: double\n", "surface_properties.weighted_surface_energy_EV_PER_ANG2: double\n", "surface_properties.weighted_work_function: double\n", "symmetry.crystal_system: string\n", "symmetry.number: int64\n", "symmetry.point_group: string\n", "symmetry.symbol: string\n", "symmetry.symprec: double\n", "symmetry.version: string\n", "symmetry.wyckoffs: list\n", " child 0, element: string\n", "thermo.decomposes_to: list>\n", " child 0, element: struct\n", " child 0, amount: double\n", " child 1, formula: string\n", " child 2, material_id: string\n", "thermo.energy_above_hull: double\n", "thermo.equilibrium_reaction_energy_per_atom: double\n", "thermo.formation_energy_per_atom: double\n", "thermo.uncorrected_energy_per_atom: double\n", "----\n", "bonding.cutoff_method.bond_connections: [[null,null,...,null,null]]\n", "bonding.electric_consistent.bond_connections: [[[[],[2,2,3,3,4,4],[1,1],[1,1],[1,1]],[[],[],...,[2],[2]],...,[[8,9,10,11,12,16,16,19,19],[8,9,10,11,13,17,17,18,18],...,[1,1,2,2,4,7,13,13,14],[0,0,3,3,5,6,12,12,15]],null]]\n", "bonding.electric_consistent.bond_orders: [[[[],[0.6945,0.6945,0.6945,0.6945,0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945]],[[],[],...,[0.5889],[0.5889]],...,[[0.2795,0.2883,0.2883,0.2795,0.2481,0.307,0.3419,0.311,0.311],[0.2883,0.2795,0.2795,0.2883,0.2481,0.3419,0.307,0.311,0.311],...,[0.311,0.311,0.307,0.3419,0.5776,0.5776,0.1163,0.1163,0.4588],[0.311,0.311,0.3419,0.307,0.5776,0.5776,0.1163,0.1163,0.4588]],null]]\n", "bonding.geometric_consistent.bond_connections: [[[[2,2,2,2,3,...,3,4,4,4,4],[2,2,3,3,4,4],[1,1],[1,1],[1,1]],[[6,7,12,13,10,11],[4,5,14,15,8,9],...,[2,1],[2,1]],...,[[16,19,19,16,9,10,8,11,12],[17,18,18,17,8,11,9,10,13],...,[4,7,14,2,1,1,2],[5,6,15,3,0,0,3]],null]]\n", "bonding.geometric_consistent.bond_orders: [[[[0.0734,0.0734,0.0734,0.0734,0.0734,...,0.0734,0.0734,0.0734,0.0734,0.0734],[0.6945,0.6945,0.6945,0.6945,0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945]],[[0.0842,0.0842,0.0707,0.0707,0.0619,0.0619],[0.0842,0.0842,0.0707,0.0707,0.0619,0.0619],...,[0.5889,0.0707],[0.5889,0.0707]],...,[[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],...,[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307],[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307]],null]]\n", "bonding.geometric_electric_consistent.bond_connections: [[[[],[2,2,3,3,4,4],[1,1],[1,1],[1,1]],[[],[],...,[2],[2]],...,[[16,19,19,16,9,10,8,11,12],[17,18,18,17,8,11,9,10,13],...,[4,7,14,2,1,1,2],[5,6,15,3,0,0,3]],null]]\n", "bonding.geometric_electric_consistent.bond_orders: [[[[],[0.6945,0.6945,0.6945,0.6945,0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945]],[[],[],...,[0.5889],[0.5889]],...,[[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],...,[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307],[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307]],null]]\n", "chargemol.bond_connections: [[[[0,0,0,0,0,...,3,4,4,4,4],[0,0,0,0,0,...,2,3,3,4,4],[0,0,0,0,1,...,3,4,4,4,4],[0,0,0,0,1,...,3,4,4,4,4],[0,0,0,0,1,...,3,4,4,4,4]],[[2,3,3,3,3,...,9,10,11,12,13],[2,2,2,2,3,...,9,10,11,14,15],...,[1,2,2,2,3,...,9,11,15,15,15],[1,2,2,2,3,...,9,10,14,14,14]],...,[[3,3,3,3,5,...,15,16,16,19,19],[2,2,2,2,4,...,14,17,17,18,18],...,[1,1,2,2,4,...,14,17,17,17,17],[0,0,3,3,4,...,15,16,16,16,16]],null]]\n", "chargemol.bond_orders: [[[[0.0033,0.0033,0.0033,0.0033,0.0033,...,0.0734,0.0734,0.0734,0.0734,0.0734],[0.0155,0.0155,0.0155,0.0155,0.0155,...,0.6945,0.6945,0.6945,0.6945,0.6945],[0.0734,0.0734,0.0734,0.0734,0.6945,...,0.0347,0.0347,0.0347,0.0347,0.0347],[0.0734,0.0734,0.0734,0.0734,0.6945,...,0.0017,0.0347,0.0347,0.0347,0.0347],[0.0734,0.0734,0.0734,0.0734,0.6945,...,0.0347,0.0017,0.0017,0.0017,0.0017]],[[0.0049,0.0011,0.0011,0.0011,0.0011,...,0.002,0.0619,0.0619,0.0707,0.0707],[0.0011,0.0011,0.0011,0.0011,0.0049,...,0.0619,0.002,0.002,0.0707,0.0707],...,[0.0707,0.5889,0.0013,0.0013,0.0046,...,0.0271,0.0435,0.0076,0.0076,0.0438],[0.0707,0.0013,0.0013,0.5889,0.0046,...,0.0259,0.0435,0.0076,0.0076,0.0438]],...,[[0.0167,0.0179,0.0167,0.0179,0.0612,...,0.002,0.307,0.3419,0.311,0.311],[0.0179,0.0167,0.0179,0.0167,0.0612,...,0.002,0.3419,0.307,0.311,0.311],...,[0.311,0.311,0.307,0.3419,0.5776,...,0.4588,0.0302,0.0405,0.0302,0.0405],[0.311,0.311,0.3419,0.307,0.0043,...,0.0265,0.0405,0.0302,0.0405,0.0302]],null]]\n", "chargemol.cubed_moments: [[[71.579692,117.080036,41.119066,41.119066,41.119066],[4.090622,4.090516,41.32723,41.326694,22.622699,...,22.349214,22.449562,22.449562,22.449615,22.449615],...,[144.779124,144.779125,144.779138,144.779139,82.876595,...,168.458199,194.520457,194.520458,194.520409,194.52041],null]]\n", "...\n" ] } ], "source": [ "elements_table = pq.read_table(elements_file)\n", "materials_table = pq.read_table(materials_file)\n", "print(elements_table)\n", "print(materials_table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we can load the materials data them into `ParquetGraphDB`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "GRAPH DATABASE SUMMARY\n", "============================================================\n", "Name: GraphDB\n", "Storage path: data\\GraphDB\n", "└── Repository structure:\n", " ├── nodes/ (data\\GraphDB\\nodes)\n", " ├── edges/ (data\\GraphDB\\edges)\n", " ├── edge_generators/ (data\\GraphDB\\edge_generators)\n", " ├── node_generators/ (data\\GraphDB\\node_generators)\n", " └── graph/ (data\\GraphDB\\graph)\n", "\n", "############################################################\n", "NODE DETAILS\n", "############################################################\n", "Total node types: 1\n", "------------------------------------------------------------\n", "• Node type: material\n", " - Number of nodes: 1000\n", " - Number of features: 136\n", " - Columns:\n", " - bonding.cutoff_method.bond_connections\n", " - bonding.electric_consistent.bond_connections\n", " - bonding.electric_consistent.bond_orders\n", " - bonding.geometric_consistent.bond_connections\n", " - bonding.geometric_consistent.bond_orders\n", " - bonding.geometric_electric_consistent.bond_connections\n", " - bonding.geometric_electric_consistent.bond_orders\n", " - chargemol.bond_connections\n", " - chargemol.bond_orders\n", " - chargemol.cubed_moments\n", " - chargemol.fourth_moments\n", " - chargemol.squared_moments\n", " - chemenv.coordination_environments_multi_weight\n", " - chemenv.coordination_multi_connections\n", " - chemenv.coordination_multi_numbers\n", " - core.atomic_numbers\n", " - core.cartesian_coords\n", " - core.density\n", " - core.density_atomic\n", " - core.elements\n", " - core.energy_per_atom\n", " - core.formula\n", " - core.formula_pretty\n", " - core.frac_coords\n", " - core.is_gap_direct\n", " - core.is_magnetic\n", " - core.is_metal\n", " - core.is_stable\n", " - core.lattice\n", " - core.material_id\n", " - core.nelements\n", " - core.nsites\n", " - core.species\n", " - core.volume\n", " - dielectric.e_electronic\n", " - dielectric.e_ij_max\n", " - dielectric.e_ionic\n", " - dielectric.e_total\n", " - dielectric.n\n", " - elasticity.compliance_tensor_ieee_format\n", " - elasticity.compliance_tensor_raw\n", " - elasticity.debye_temperature\n", " - elasticity.elastic_tensor_ieee_format\n", " - elasticity.elastic_tensor_raw\n", " - elasticity.g_reuss\n", " - elasticity.g_voigt\n", " - elasticity.g_vrh\n", " - elasticity.homogeneous_poisson\n", " - elasticity.k_reuss\n", " - elasticity.k_voigt\n", " - elasticity.k_vrh\n", " - elasticity.order\n", " - elasticity.sound_velocity_acoustic\n", " - elasticity.sound_velocity_longitudinal\n", " - elasticity.sound_velocity_optical\n", " - elasticity.sound_velocity_total\n", " - elasticity.sound_velocity_transverse\n", " - elasticity.state\n", " - elasticity.thermal_conductivity_cahill\n", " - elasticity.thermal_conductivity_clarke\n", " - elasticity.universal_anisotropy\n", " - elasticity.warnings\n", " - elasticity.young_modulus\n", " - electronic_structure.band_gap\n", " - electronic_structure.cbm\n", " - electronic_structure.dos_energy_up\n", " - electronic_structure.efermi\n", " - electronic_structure.vbm\n", " - feature_vectors.element_fraction\n", " - feature_vectors.element_property\n", " - feature_vectors.sine_coulomb_matrix\n", " - feature_vectors.xrd_pattern\n", " - grain_boundaries.grain_boundaries\n", " - has_props.absorption\n", " - has_props.bandstructure\n", " - has_props.charge_density\n", " - has_props.chemenv\n", " - has_props.dielectric\n", " - has_props.dos\n", " - has_props.elasticity\n", " - has_props.electronic_structure\n", " - has_props.eos\n", " - has_props.grain_boundaries\n", " - has_props.insertion_electrodes\n", " - has_props.magnetism\n", " - has_props.materials\n", " - has_props.oxi_states\n", " - has_props.phonon\n", " - has_props.piezoelectric\n", " - has_props.provenance\n", " - has_props.substrates\n", " - has_props.surface_properties\n", " - has_props.thermo\n", " - has_props.xas\n", " - id\n", " - magnetism.num_magnetic_sites\n", " - magnetism.num_unique_magnetic_sites\n", " - magnetism.ordering\n", " - magnetism.total_magnetization\n", " - magnetism.total_magnetization_normalized_vol\n", " - magnetism.types_of_magnetic_species\n", " - metadata.last_updated\n", " - metadata.theoretical\n", " - oxidation_states.method\n", " - oxidation_states.possible_species\n", " - oxidation_states.possible_valences\n", " - structure.@class\n", " - structure.@module\n", " - structure.charge\n", " - structure.lattice.a\n", " - structure.lattice.alpha\n", " - structure.lattice.b\n", " - structure.lattice.beta\n", " - structure.lattice.c\n", " - structure.lattice.gamma\n", " - structure.lattice.matrix\n", " - structure.lattice.pbc\n", " - structure.lattice.volume\n", " - structure.sites\n", " - surface_properties.shape_factor\n", " - surface_properties.surface_anisotropy\n", " - surface_properties.weighted_surface_energy\n", " - surface_properties.weighted_surface_energy_EV_PER_ANG2\n", " - surface_properties.weighted_work_function\n", " - symmetry.crystal_system\n", " - symmetry.number\n", " - symmetry.point_group\n", " - symmetry.symbol\n", " - symmetry.symprec\n", " - symmetry.version\n", " - symmetry.wyckoffs\n", " - thermo.decomposes_to\n", " - thermo.energy_above_hull\n", " - thermo.equilibrium_reaction_energy_per_atom\n", " - thermo.formation_energy_per_atom\n", " - thermo.uncorrected_energy_per_atom\n", " - db_path: data\\GraphDB\\nodes\\material\n", "------------------------------------------------------------\n", "\n", "############################################################\n", "EDGE DETAILS\n", "############################################################\n", "Total edge types: 0\n", "------------------------------------------------------------\n", "\n", "############################################################\n", "NODE GENERATOR DETAILS\n", "############################################################\n", "Total node generators: 0\n", "------------------------------------------------------------\n", "\n", "############################################################\n", "EDGE GENERATOR DETAILS\n", "############################################################\n", "Total edge generators: 0\n", "------------------------------------------------------------\n", "\n" ] } ], "source": [ "from parquetdb import ParquetGraphDB\n", "\n", "# Create a temporary directory for our database\n", "GRAPH_DB_DIR = DATA_DIR / \"GraphDB\"\n", "if GRAPH_DB_DIR.exists():\n", " shutil.rmtree(GRAPH_DB_DIR)\n", "GRAPH_DB_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "\n", "# Initialize ParquetGraphDB\n", "db = ParquetGraphDB(storage_path=GRAPH_DB_DIR)\n", "\n", "# The data has an previous id column, we have to remove it\n", "data = pq.read_table(materials_file)\n", "data = data.drop_columns(\"id\")\n", "db.add_nodes(node_type=\"material\", data=data)\n", "\n", "print(db.summary(show_column_names=True))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generators\n", "\n", "A **Generator** is a callable (function) that returns a [PyArrow Table](https://arrow.apache.org/docs/python/api/table.html) of either nodes or edges. By adding a generator to `ParquetGraphDB`, you can:\n", "\n", "1. Register the generator, so it can be re-run on demand.\n", "2. Optionally specify arguments/kwargs to pass into the generator.\n", "3. Automatically store the output in a **NodeStore** or **EdgeStore** with the same name as the generator function (or a custom name, if you prefer).\n", "\n", "This is especially handy for generating nodes from external data sources or from computational routines.\n", "\n", "In the following sections we will create custom node and edge generators. These can be create by wrapping existing functions with the `node_generator` or `edge_generator` decorators.\n", "\n", "These can be imported like:\n", "\n", "```python\n", "from parquetdb import node_generator, edge_generator\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Element Node Generator\n", "\n", "\n", "#### 1. Define the Generator\n", "\n", "In our first example, we will create a node generator that creates element nodes. \n", "\n", "As mentioned above to create a node generator, we will wrap an existing function with the `node_generator` decorator. The function name will be the name of the node type.\n", "\n", "```python\n", "@node_generator\n", "def element():\n", " ...\n", "```\n", "\n", "For this example, we will import an periodic table data. This is a dataframe with 118 rows representing 118 elements of the periodic table. We have also added some transformations to the data to make it more useful for our purposes." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " long_name symbol abundance_universe abundance_solar \\\n", "0 Hydrogen H 7.500000e+01 7.500000e+01 \n", "1 Helium He 2.300000e+01 2.300000e+01 \n", "2 Lithium Li 6.000000e-07 6.000000e-09 \n", "3 Beryllium Be 1.000000e-07 1.000000e-08 \n", "4 Boron B 1.000000e-07 2.000000e-07 \n", ".. ... ... ... ... \n", "113 Flerovium Fl 0.000000e+00 0.000000e+00 \n", "114 Moscovium Mc 0.000000e+00 0.000000e+00 \n", "115 Livermorium Lv 0.000000e+00 0.000000e+00 \n", "116 Tennessine Ts 0.000000e+00 0.000000e+00 \n", "117 Oganesson Og 0.000000e+00 0.000000e+00 \n", "\n", " abundance_meteor abundance_crust abundance_ocean abundance_human \\\n", "0 2.400000 1.500000e-01 1.100000e+01 1.000000e+01 \n", "1 NaN 5.500000e-07 7.200000e-10 NaN \n", "2 0.000170 1.700000e-03 1.800000e-05 3.000000e-06 \n", "3 0.000003 1.900000e-04 6.000000e-11 4.000000e-08 \n", "4 0.000160 8.600000e-04 4.400000e-04 7.000000e-05 \n", ".. ... ... ... ... \n", "113 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 \n", "114 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 \n", "115 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 \n", "116 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 \n", "117 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 \n", "\n", " adiabatic_index allotropes ... \\\n", "0 5-Jul Dihydrogen ... \n", "1 3-May None ... \n", "2 None None ... \n", "3 None None ... \n", "4 None Alpha Rhombohedral Boron, Beta Rhombohedral Bo... ... \n", ".. ... ... ... \n", "113 None None ... \n", "114 None None ... \n", "115 None None ... \n", "116 None None ... \n", "117 None None ... \n", "\n", " is_halogen is_lanthanoid is_metal is_metalloid is_noble_gas \\\n", "0 False False False False False \n", "1 False False False False True \n", "2 False False True False False \n", "3 False False True False False \n", "4 False False False True False \n", ".. ... ... ... ... ... \n", "113 False False False False False \n", "114 False False False False False \n", "115 False False False False False \n", "116 False False False False False \n", "117 False False False False True \n", "\n", " is_post_transition_metal is_quadrupolar is_rare_earth_metal \\\n", "0 False True False \n", "1 False False False \n", "2 False True False \n", "3 False True False \n", "4 False True False \n", ".. ... ... ... \n", "113 False False False \n", "114 False False False \n", "115 False False False \n", "116 False False False \n", "117 False False False \n", "\n", " experimental_oxidation_states ionization_energies \n", "0 [] [1312.0] \n", "1 [] [2372.3, 5250.5] \n", "2 [1] [520.2, 7298.1, 11815.0] \n", "3 [2] [899.5, 1757.1, 14848.7, 21006.6] \n", "4 [3] [800.6, 2427.1, 3659.7, 25025.8, 32826.7] \n", ".. ... ... \n", "113 [2] [832.2, 1600.0, 3370.0, 4400.0, 5850.0] \n", "114 [3] [538.3, 1760.0, 2650.0, 4680.0, 5720.0] \n", "115 [-2] [663.9, 1330.0, 2850.0, 3810.0, 6080.0] \n", "116 [-1] [736.9, 1435.4, 2161.9, 4012.9, 5076.4] \n", "117 [] [860.1, 1560.0] \n", "\n", "[118 rows x 98 columns]\n" ] } ], "source": [ "### Element Node Generator\n", "from parquetdb import node_generator\n", "\n", "# Define the generator with the @node_generator decorator\n", "@node_generator\n", "def element(base_file=elements_file):\n", " \"\"\"\n", " Creates Element nodes from a local file (CSV or Parquet).\n", " Returns a Pandas DataFrame (or PyArrow Table) with one row per element.\n", " \"\"\"\n", "\n", " try:\n", " # Read the file\n", " file_ext = os.path.splitext(base_file)[-1][\n", " 1:\n", " ].lower() # e.g. \"parquet\" or \"csv\"\n", " if file_ext == \"parquet\":\n", " df = pd.read_parquet(base_file)\n", " elif file_ext == \"csv\":\n", " df = pd.read_csv(base_file)\n", " else:\n", " raise ValueError(\"base_file must be a parquet or csv file\")\n", "\n", " # Apply some transformations\n", " # Example transformations\n", " df[\"oxidation_states\"] = df[\"oxidation_states\"].apply(\n", " lambda x: x.replace(\"]\", \"\").replace(\"[\", \"\")\n", " )\n", " df[\"oxidation_states\"] = df[\"oxidation_states\"].apply(\n", " lambda x: \",\".join(x.split())\n", " )\n", " df[\"oxidation_states\"] = df[\"oxidation_states\"].apply(\n", " lambda x: eval(\"[\" + x + \"]\")\n", " )\n", " df[\"experimental_oxidation_states\"] = df[\"experimental_oxidation_states\"].apply(\n", " lambda x: eval(x)\n", " )\n", " df[\"ionization_energies\"] = df[\"ionization_energies\"].apply(lambda x: eval(x))\n", "\n", " except Exception as e:\n", " print(f\"Error reading element file: {e}\")\n", " return None\n", "\n", " return df # Return the transformed dataframe\n", "\n", "\n", "df = element()\n", "\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Add the Generator to the ParquetGraphDB\n", "\n", "Now that we have defined the generator, we can add it to the `ParquetGraphDB` instance. We do this by calling the `add_node_generator` method. Here we give the function, the arguments, and the kwargs. We also have the option to run the generator immediately or later. Default is True.\n", "\n", "The node generator will be stored in the `node_generator_store` of the `ParquetGraphDB` instance.\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "GENERATOR STORE SUMMARY\n", "============================================================\n", "• Number of generators: 1\n", "Storage path: data\\GraphDB\\node_generators\n", "\n", "\n", "############################################################\n", "METADATA\n", "############################################################\n", "• class: GeneratorStore\n", "• class_module: parquetdb.graph.generator_store\n", "\n", "############################################################\n", "GENERATOR DETAILS\n", "############################################################\n", "• Columns:\n", " - generator_func\n", " - generator_kwargs.base_file\n", " - generator_name\n", " - id\n", "\n", "• Generator names:\n", " - element\n", "\n" ] } ], "source": [ "db.add_node_generator(\n", " generator_func=element,\n", " generator_args={},\n", " generator_kwargs={\"base_file\": elements_file},\n", " run_immediately=False, # We have the option to run the generator immediately or later. Default is True.\n", ")\n", "\n", "# Check the node generators in the MatGraphDB\n", "\n", "print(db.node_generator_store)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running a Node Generator Later\n", "\n", "Now we can run the node generator with `db.run_node_generator(generator_name)`.\n", "\n", "> `Note:` Here we run the node generator. Notice how we do not need pass the arguments or kwargs, this information is stored in the node generator store.\n", "> However, we can override the arguments or kwargs if we want to." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "table = db.run_node_generator(\"element\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets check the node store for the elements.\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "NODE STORE SUMMARY\n", "============================================================\n", "Node type: element\n", "• Number of nodes: 118\n", "• Number of features: 99\n", "Storage path: data\\GraphDB\\nodes\\element\n", "\n", "\n", "############################################################\n", "METADATA\n", "############################################################\n", "• class: NodeStore\n", "• class_module: parquetdb.graph.nodes\n", "• node_type: element\n", "• name_column: id\n", "\n", "############################################################\n", "NODE DETAILS\n", "############################################################\n", "\n" ] } ], "source": [ "element_node_store = db.get_node_store(\"element\")\n", "print(element_node_store)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Material-Element Edge Generator\n", "\n", "#### 1. Define the Generator\n", "\n", "An **edge generator** is similar to a node generator but returns a PyArrow Table describing edges. Each generated edge must have at least these fields:\n", "\n", "- `source_id` (int)\n", "- `source_type` (string)\n", "- `target_id` (int)\n", "- `target_type` (string)\n", "\n", "Additionally, edge_generators must have the corresponding node_stores in the `ParquetGraphDB` instance as an argument. This is to ensure that the ids of the nodes are valid and in the correct node store. \n", "\n", "For edges we use the `edge_generator` decorator.\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from parquetdb import edge_generator\n", "import pyarrow as pa\n", "\n", "\n", "@edge_generator\n", "def material_element_has(\n", " material_store, element_store\n", "): # We have the material_store and element_store as an argument\n", " try:\n", " connection_name = \"has\"\n", "\n", " # We select only the necessary columns from the node stores\n", " material_table = material_store.read_nodes(\n", " columns=[\"id\", \"core.material_id\", \"core.elements\"]\n", " )\n", " element_table = element_store.read_nodes(columns=[\"id\", \"symbol\"])\n", "\n", " # We rename for utility purposes\n", " material_table = material_table.rename_columns(\n", " {\"id\": \"source_id\", \"core.material_id\": \"material_name\"}\n", " )\n", " material_table = material_table.append_column(\n", " \"source_type\", pa.array([\"material\"] * material_table.num_rows)\n", " )\n", "\n", " element_table = element_table.rename_columns({\"id\": \"target_id\"})\n", " element_table = element_table.append_column(\n", " \"target_type\", pa.array([\"elements\"] * element_table.num_rows)\n", " )\n", "\n", " # We convert the tables to pandas for easier manipulation\n", " material_df = material_table.to_pandas()\n", " element_df = element_table.to_pandas()\n", "\n", " # We create a map of the element symbols to the target_id for quick lookup\n", " element_target_id_map = {\n", " row[\"symbol\"]: row[\"target_id\"] for _, row in element_df.iterrows()\n", " }\n", "\n", " # We create a dictionary to store the edge data\n", " table_dict = {\n", " \"source_id\": [],\n", " \"source_type\": [],\n", " \"target_id\": [],\n", " \"target_type\": [],\n", " \"edge_type\": [],\n", " \"name\": [],\n", " \"weight\": [],\n", " }\n", "\n", " # We iterate over the material nodes\n", " for _, row in material_df.iterrows():\n", " # We get the elements composing the material\n", " elements = row[\"core.elements\"]\n", " source_id = row[\"source_id\"]\n", " material_name = row[\"material_name\"]\n", " if elements is None:\n", " continue\n", "\n", " # We iterate over the elements\n", " for element in elements:\n", " # We get the target_id for the element\n", " target_id = element_target_id_map[element]\n", "\n", " # We append the edge data to the dictionary. Here we could also define the reverse edge as well.\n", " table_dict[\"source_id\"].append(source_id)\n", " table_dict[\"source_type\"].append(material_store.node_type)\n", " table_dict[\"target_id\"].append(target_id)\n", " table_dict[\"target_type\"].append(element_store.node_type)\n", " table_dict[\"edge_type\"].append(connection_name)\n", "\n", " name = f\"{material_name}_{connection_name}_{element}\"\n", " table_dict[\"name\"].append(name)\n", " table_dict[\"weight\"].append(1.0)\n", "\n", " df = pd.DataFrame(table_dict)\n", " except Exception as e:\n", " print(f\"Error creating material-element-has relationships: {e}\")\n", "\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Add the Generator to the ParquetGraphDB\n", "\n", "Now that we have defined the generator, we can add it to the `ParquetGraphDB` instance. We do this by calling the `add_edge_generator` method.\n", "\n", "The edge generator will be stored in the `edge_generator_store` of the `ParquetGraphDB` instance.\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "element_store = db.get_node_store(\"element\")\n", "material_store = db.get_node_store(\"material\")\n", "\n", "db.add_edge_generator(\n", " generator_func=material_element_has,\n", " generator_args={\n", " \"material_store\": material_store,\n", " \"element_store\": element_store,\n", " },\n", " generator_kwargs={},\n", " run_immediately=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets check the edge generator store." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "GENERATOR STORE SUMMARY\n", "============================================================\n", "• Number of generators: 1\n", "Storage path: data\\GraphDB\\edge_generators\n", "\n", "\n", "############################################################\n", "METADATA\n", "############################################################\n", "• class: GeneratorStore\n", "• class_module: parquetdb.graph.generator_store\n", "\n", "############################################################\n", "GENERATOR DETAILS\n", "############################################################\n", "• Columns:\n", " - generator_args.element_store\n", " - generator_args.material_store\n", " - generator_func\n", " - generator_name\n", " - id\n", "\n", "• Generator names:\n", " - material_element_has\n", "\n" ] } ], "source": [ "print(db.edge_generator_store)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check to see if the edge created the edges in the edge store." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "EDGE STORE SUMMARY\n", "============================================================\n", "Edge type: material_element_has\n", "• Number of edges: 3348\n", "• Number of features: 8\n", "Storage path: data\\GraphDB\\edges\\material_element_has\n", "\n", "\n", "############################################################\n", "METADATA\n", "############################################################\n", "• class: EdgeStore\n", "• class_module: parquetdb.graph.edges\n", "\n", "############################################################\n", "EDGE DETAILS\n", "############################################################\n", "\n" ] } ], "source": [ "edge_store = db.get_edge_store(\"material_element_has\")\n", "print(edge_store)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Updates to node stores.\n", "\n", "By default, when node and edge generators are added their argument store dependencies are added to the `ParquetGraphDB` instance. This means that when parent stores are updated, the geneator will run and update their corresponding stores.\n", "\n", "These stores are stored in the `ParquetGraphDB/generator_dependency.json` file.\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " id\n", "0 0\n", "Empty DataFrame\n", "Columns: [id]\n", "Index: []\n" ] } ], "source": [ "materials_df = db.read_nodes(node_type=\"material\", columns=[\"id\"], ids=[0]).to_pandas()\n", "print(materials_df)\n", "\n", "db.delete_nodes(node_type=\"material\",ids=[0])\n", "\n", "materials_df = db.read_nodes(node_type=\"material\", columns=[\"id\"], ids=[0]).to_pandas()\n", "print(materials_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see the material node with `id=0` is now gone.\n", "\n", "Let's check the `material_element_has` edges to see if the update has been propagated" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "EDGE STORE SUMMARY\n", "============================================================\n", "Edge type: material_element_has\n", "• Number of edges: 3345\n", "• Number of features: 8\n", "Storage path: data\\GraphDB\\edges\\material_element_has\n", "\n", "\n", "############################################################\n", "METADATA\n", "############################################################\n", "• class: EdgeStore\n", "• class_module: parquetdb.graph.edges\n", "\n", "############################################################\n", "EDGE DETAILS\n", "############################################################\n", "\n" ] } ], "source": [ "edge_store = db.get_edge_store(\"material_element_has\")\n", "print(edge_store)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now there are 3345 `material_element_has` edges which has reduced from 3348 from before the deletion\n", "\n", "Let's check the `material_element_has` dataframe." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " edge_type id name source_id source_type target_id \\\n", "0 has 0 mp-1222351_has_F 1 material 8 \n", "1 has 1 mp-1222351_has_Fe 1 material 25 \n", "2 has 2 mp-1222351_has_Li 1 material 2 \n", "3 has 3 mp-651087_has_F 2 material 8 \n", "4 has 4 mp-651087_has_Gd 2 material 63 \n", "... ... ... ... ... ... ... \n", "3340 has 3340 mp-2714707_has_Al 999 material 12 \n", "3341 has 3341 mp-2714707_has_Na 999 material 10 \n", "3342 has 3342 mp-2714707_has_O 999 material 7 \n", "3343 has 3343 mp-2714707_has_S 999 material 15 \n", "3344 has 3344 mp-2714707_has_Zn 999 material 29 \n", "\n", " target_type weight \n", "0 element 1.0 \n", "1 element 1.0 \n", "2 element 1.0 \n", "3 element 1.0 \n", "4 element 1.0 \n", "... ... ... \n", "3340 element 1.0 \n", "3341 element 1.0 \n", "3342 element 1.0 \n", "3343 element 1.0 \n", "3344 element 1.0 \n", "\n", "[3345 rows x 8 columns]\n" ] } ], "source": [ "df = edge_store.read_edges().to_pandas()\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the `source_id` does not have and `id=0`.\n", "\n", "We can double check this with the following:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
edge_typeidnamesource_idsource_typetarget_idtarget_typeweight
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [edge_type, id, name, source_id, source_type, target_id, target_type, weight]\n", "Index: []" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"source_type\"] == 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is empty as we should expect." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "============================================================\n", "GRAPH DATABASE SUMMARY\n", "============================================================\n", "Name: GraphDB\n", "Storage path: data\\GraphDB\n", "└── Repository structure:\n", " ├── nodes/ (data\\GraphDB\\nodes)\n", " ├── edges/ (data\\GraphDB\\edges)\n", " ├── edge_generators/ (data\\GraphDB\\edge_generators)\n", " ├── node_generators/ (data\\GraphDB\\node_generators)\n", " └── graph/ (data\\GraphDB\\graph)\n", "\n", "############################################################\n", "NODE DETAILS\n", "############################################################\n", "Total node types: 2\n", "------------------------------------------------------------\n", "• Node type: material\n", " - Number of nodes: 999\n", " - Number of features: 136\n", " - db_path: data\\GraphDB\\nodes\\material\n", "------------------------------------------------------------\n", "• Node type: element\n", " - Number of nodes: 118\n", " - Number of features: 99\n", " - db_path: data\\GraphDB\\nodes\\element\n", "------------------------------------------------------------\n", "\n", "############################################################\n", "EDGE DETAILS\n", "############################################################\n", "Total edge types: 1\n", "------------------------------------------------------------\n", "• Edge type: material_element_has\n", " - Number of edges: 3345\n", " - Number of features: 8\n", " - db_path: data\\GraphDB\\edges\\material_element_has\n", "------------------------------------------------------------\n", "\n", "############################################################\n", "NODE GENERATOR DETAILS\n", "############################################################\n", "Total node generators: 1\n", "------------------------------------------------------------\n", "• Generator: element\n", "Generator Args:\n", " - generator_func: []\n", " - generator_kwargs.base_file: [WindowsPath('data/elements.parquet')]\n", " - generator_name: ['element']\n", " - id: [0]\n", "Generator Kwargs:\n", " - base_file: [WindowsPath('data/elements.parquet')]\n", "------------------------------------------------------------\n", "\n", "############################################################\n", "EDGE GENERATOR DETAILS\n", "############################################################\n", "Total edge generators: 1\n", "------------------------------------------------------------\n", "• Generator: material_element_has\n", "Generator Args:\n", " - element_store: data\\GraphDB\\nodes\\element\n", " - material_store: data\\GraphDB\\nodes\\material\n", "Generator Kwargs:\n", "------------------------------------------------------------\n", "\n" ] } ], "source": [ "print(db)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Summary\n", "\n", "In this notebook, we showed how to define custom node and edge generators and showed how to run them." ] } ], "metadata": { "kernelspec": { "display_name": "parquetdb", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.17" }, "nbsphinx": { "execute": "never" } }, "nbformat": 4, "nbformat_minor": 2 }